The aim of the project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.
The dataset used for the analysis and visualisation is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.
This Twitter archive contains basic tweet data (tweet ID, timestamp, text, etc.) for 2356 of the 5000+ of their tweets as they stood on August 1, 2017. The Twitter archive data does not contain the retweet count and favourite count for each tweet and this has taken from another data file provided by Udacity. Also image predictions file from Udacity containing the predictions for dog breeds from three different resources based on the images from the tweets has also been used for the analysis.
# Import required packages
import pandas as pd
import numpy as np
import requests
import tweepy
import os
import json
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from io import BytesIO
from PIL import Image
In the cell below, gather all three pieces of data for this project and load them in the notebook. Note: the methods required to gather each data are different.
# 'twitter_archive_enhanced.csv' file is directly downloaded from Udacity website
# reading twitter_archive_enhanced.csv into the dataframe
df_tw_enh_csv = pd.read_csv('twitter-archive-enhanced.csv',sep=',')
df_tw_enh_csv.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float64 7 retweeted_status_user_id 181 non-null float64 8 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 doggo 2356 non-null object 14 floofer 2356 non-null object 15 pupper 2356 non-null object 16 puppo 2356 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 313.0+ KB
df_tw_enh_csv.columns
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
dtype='object')
df_tw_enh_csv.head(2)
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
| 1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
# Request tsv file from the url (provided by Udacity) and check the request response
url_udacity = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url_udacity)
response
<Response [200]>
# extracting the tsv file name from url
file_name = url_udacity.split('/')[-1]
print(file_name)
image-predictions.tsv
# Write the response into tsv file
if not os.path.isfile(file_name):
with open (file_name, mode='wb') as file:
file.write(response.content)
# reading image-predictions.tsv into the dataframe
image_predic = pd.read_csv(file_name, sep='\t', encoding = 'utf-8')
image_predic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2075 non-null int64 1 jpg_url 2075 non-null object 2 img_num 2075 non-null int64 3 p1 2075 non-null object 4 p1_conf 2075 non-null float64 5 p1_dog 2075 non-null bool 6 p2 2075 non-null object 7 p2_conf 2075 non-null float64 8 p2_dog 2075 non-null bool 9 p3 2075 non-null object 10 p3_conf 2075 non-null float64 11 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 152.1+ KB
image_predic.sample(2)
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 963 | 705975130514706432 | https://pbs.twimg.com/media/CcwgjmuXIAEQoSd.jpg | 1 | Staffordshire_bullterrier | 0.587764 | True | American_Staffordshire_terrier | 0.281429 | True | bull_mastiff | 0.094798 | True |
| 1282 | 750383411068534784 | https://pbs.twimg.com/media/CmnluwbXEAAqnkw.jpg | 1 | Border_collie | 0.672791 | True | collie | 0.270188 | True | papillon | 0.034504 | True |
The code in the below cell has been taken from Udacity website (file name: 'twitter_api.py'), but when the executed the following error message dispalyed
Forbidden: 403 Forbidden
453 - You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve
from tweepy import OAuthHandler import json from timeit import default_timer as timer
consumer_key = '' consumer_secret = '' access_token = '' access_secret = ''
auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
tweet_ids = df_tw_enh_csv.tweet_id.values len(tweet_ids)
count = 0 fails_dict = {} start = timer()
with open('tweet_json.txt', 'w') as outfile:
# This loop will likely take 20-30 minutes to run because of Twitter's rate limit
for tweet_id in tweet_ids:
count += 1
print(str(count) + ": " + str(tweet_id))
try:
tweet = api.get_status(tweet_id, tweet_mode='extended')
print("Success")
json.dump(tweet._json, outfile)
outfile.write('\n')
except tweepy.TweepError as e:
print("Fail")
fails_dict[tweet_id] = e
pass
end = timer() print(end - start) print(fails_dict)
The file 'tweet_json.txt' provided udacity has been directly used in this project
# Reference: Loading and parsing a JSON file with multiple JSON objects
# https://stackoverflow.com/questions/12451431/loading-and-parsing-a-json-file-with-multiple-json-objects
df_tweet_json = pd.read_json('tweet-json.txt',lines=True)
df_tweet_json.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 created_at 2354 non-null datetime64[ns, UTC] 1 id 2354 non-null int64 2 id_str 2354 non-null int64 3 full_text 2354 non-null object 4 truncated 2354 non-null bool 5 display_text_range 2354 non-null object 6 entities 2354 non-null object 7 extended_entities 2073 non-null object 8 source 2354 non-null object 9 in_reply_to_status_id 78 non-null float64 10 in_reply_to_status_id_str 78 non-null float64 11 in_reply_to_user_id 78 non-null float64 12 in_reply_to_user_id_str 78 non-null float64 13 in_reply_to_screen_name 78 non-null object 14 user 2354 non-null object 15 geo 0 non-null float64 16 coordinates 0 non-null float64 17 place 1 non-null object 18 contributors 0 non-null float64 19 is_quote_status 2354 non-null bool 20 retweet_count 2354 non-null int64 21 favorite_count 2354 non-null int64 22 favorited 2354 non-null bool 23 retweeted 2354 non-null bool 24 possibly_sensitive 2211 non-null float64 25 possibly_sensitive_appealable 2211 non-null float64 26 lang 2354 non-null object 27 retweeted_status 179 non-null object 28 quoted_status_id 29 non-null float64 29 quoted_status_id_str 29 non-null float64 30 quoted_status 28 non-null object dtypes: bool(4), datetime64[ns, UTC](1), float64(11), int64(4), object(11) memory usage: 505.9+ KB
# Taking only the data from required columns 'id','retweet_count','favorite_count' to a new dataframe
tweet_json = pd.DataFrame(data = df_tweet_json, columns = ['id','retweet_count','favorite_count'])
tweet_json.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 2354 non-null int64 1 retweet_count 2354 non-null int64 2 favorite_count 2354 non-null int64 dtypes: int64(3) memory usage: 55.3 KB
tweet_json.to_csv('tweet_json_short.csv',index=False)
In this section, detect and document at least eight (8) quality issues and two (2) tidiness issue. You must use both visual assessment programmatic assessement to assess the data.
Note: pay attention to the following key points when you access the data.
df_tw_enh_csv.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float64 7 retweeted_status_user_id 181 non-null float64 8 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 doggo 2356 non-null object 14 floofer 2356 non-null object 15 pupper 2356 non-null object 16 puppo 2356 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 313.0+ KB
type(df_tw_enh_csv.timestamp[100])
str
1) There are 181 redundant retweet rows 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'
2) There are 78 redundant in_reply_to rows 'in_reply_to_status_id', 'in_reply_to_user_id'
3) Incorrect data type for 'timestamp' (in string format)
4) The column 'expanded_urls' has data only for 2297 rows (the remaining 59 rows (2356 - 2297 = 59) have no data)
df_tw_enh_csv.describe()
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | retweeted_status_id | retweeted_status_user_id | rating_numerator | rating_denominator | |
|---|---|---|---|---|---|---|---|
| count | 2.356000e+03 | 7.800000e+01 | 7.800000e+01 | 1.810000e+02 | 1.810000e+02 | 2356.000000 | 2356.000000 |
| mean | 7.427716e+17 | 7.455079e+17 | 2.014171e+16 | 7.720400e+17 | 1.241698e+16 | 13.126486 | 10.455433 |
| std | 6.856705e+16 | 7.582492e+16 | 1.252797e+17 | 6.236928e+16 | 9.599254e+16 | 45.876648 | 6.745237 |
| min | 6.660209e+17 | 6.658147e+17 | 1.185634e+07 | 6.661041e+17 | 7.832140e+05 | 0.000000 | 0.000000 |
| 25% | 6.783989e+17 | 6.757419e+17 | 3.086374e+08 | 7.186315e+17 | 4.196984e+09 | 10.000000 | 10.000000 |
| 50% | 7.196279e+17 | 7.038708e+17 | 4.196984e+09 | 7.804657e+17 | 4.196984e+09 | 11.000000 | 10.000000 |
| 75% | 7.993373e+17 | 8.257804e+17 | 4.196984e+09 | 8.203146e+17 | 4.196984e+09 | 12.000000 | 10.000000 |
| max | 8.924206e+17 | 8.862664e+17 | 8.405479e+17 | 8.874740e+17 | 7.874618e+17 | 1776.000000 | 170.000000 |
df_tw_enh_csv[df_tw_enh_csv['rating_denominator']!=10].count()
tweet_id 23 in_reply_to_status_id 5 in_reply_to_user_id 5 timestamp 23 source 23 text 23 retweeted_status_id 1 retweeted_status_user_id 1 retweeted_status_timestamp 1 expanded_urls 19 rating_numerator 23 rating_denominator 23 name 23 doggo 23 floofer 23 pupper 23 puppo 23 dtype: int64
df_tw_enh_csv['rating_numerator'].value_counts().sort_index(ascending=False)
1776 1 960 1 666 1 420 2 204 1 182 1 165 1 144 1 143 1 121 1 99 1 88 1 84 1 80 1 75 2 60 1 50 1 45 1 44 1 27 1 26 1 24 1 20 1 17 1 15 2 14 54 13 351 12 558 11 464 10 461 9 158 8 102 7 55 6 32 5 37 4 17 3 19 2 9 1 9 0 2 Name: rating_numerator, dtype: int64
1) There are 23 rows where the 'rating_denominator' is not equal to 10
2) 'rating_numerator' for some of the rows is very high (like 1776, 960 etc) and seems incorrect
sum(df_tw_enh_csv['source'].duplicated())
2352
df_tw_enh_csv.head(5)
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
| 1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
| 2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None |
| 3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None |
| 4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | 12 | 10 | Franklin | None | None | None | None |
df_tw_enh_csv.tail(5)
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2351 | 666049248165822465 | NaN | NaN | 2015-11-16 00:24:50 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a 1949 1st generation vulpix. Enj... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666049248... | 5 | 10 | None | None | None | None | None |
| 2352 | 666044226329800704 | NaN | NaN | 2015-11-16 00:04:52 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a purebred Piers Morgan. Loves to Netf... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666044226... | 6 | 10 | a | None | None | None | None |
| 2353 | 666033412701032449 | NaN | NaN | 2015-11-15 23:21:54 +0000 | <a href="http://twitter.com/download/iphone" r... | Here is a very happy pup. Big fan of well-main... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666033412... | 9 | 10 | a | None | None | None | None |
| 2354 | 666029285002620928 | NaN | NaN | 2015-11-15 23:05:30 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a western brown Mitsubishi terrier. Up... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666029285... | 7 | 10 | a | None | None | None | None |
| 2355 | 666020888022790149 | NaN | NaN | 2015-11-15 22:32:08 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a Japanese Irish Setter. Lost eye... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666020888... | 8 | 10 | None | None | None | None | None |
df_tw_enh_csv.sample(5)
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1193 | 717537687239008257 | NaN | NaN | 2016-04-06 02:21:30 +0000 | <a href="http://twitter.com/download/iphone" r... | People please. This is a Deadly Mediterranean ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/717537687... | 11 | 10 | a | None | None | None | None |
| 1428 | 697616773278015490 | NaN | NaN | 2016-02-11 03:02:54 +0000 | <a href="http://twitter.com/download/iphone" r... | This pupper doubles as a hallway rug. Very rar... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/697616773... | 11 | 10 | None | None | None | pupper | None |
| 2259 | 667550904950915073 | NaN | NaN | 2015-11-20 03:51:52 +0000 | <a href="http://twitter.com" rel="nofollow">Tw... | RT @dogratingrating: Exceptional talent. Origi... | 6.675487e+17 | 4.296832e+09 | 2015-11-20 03:43:06 +0000 | https://twitter.com/dogratingrating/status/667... | 12 | 10 | None | None | None | None | None |
| 631 | 794205286408003585 | NaN | NaN | 2016-11-03 15:51:10 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Laika. She was a space pupper. The fir... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/794205286... | 14 | 10 | Laika | None | None | pupper | None |
| 851 | 765669560888528897 | NaN | NaN | 2016-08-16 22:00:23 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tove. She's a Balsamic Poinsetter. Sur... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/765669560... | 12 | 10 | Tove | None | None | None | None |
df_tw_enh_csv.name.value_counts().head(20)
None 745 a 55 Charlie 12 Cooper 11 Lucy 11 Oliver 11 Tucker 10 Penny 10 Lola 10 Winston 9 Bo 9 Sadie 8 the 8 Daisy 7 Buddy 7 Toby 7 an 7 Bailey 7 Leo 6 Oscar 6 Name: name, dtype: int64
1) There are 745 rows name entered as 'None', 55 rows name entered as 'a' and 7 rows name entered as 'an'
image_predic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2075 non-null int64 1 jpg_url 2075 non-null object 2 img_num 2075 non-null int64 3 p1 2075 non-null object 4 p1_conf 2075 non-null float64 5 p1_dog 2075 non-null bool 6 p2 2075 non-null object 7 p2_conf 2075 non-null float64 8 p2_dog 2075 non-null bool 9 p3 2075 non-null object 10 p3_conf 2075 non-null float64 11 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 152.1+ KB
image_predic.head(5)
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
| 1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
| 2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
| 3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
| 4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
image_predic.tail(5)
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2070 | 891327558926688256 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | 2 | basset | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True |
| 2071 | 891689557279858688 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | 1 | paper_towel | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False |
| 2072 | 891815181378084864 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | 1 | Chihuahua | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True |
| 2073 | 892177421306343426 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | 1 | Chihuahua | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True |
| 2074 | 892420643555336193 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
sum(image_predic.duplicated())
0
image_predic['p1_dog'].value_counts()
True 1532 False 543 Name: p1_dog, dtype: int64
image_predic['p2_dog'].value_counts()
True 1553 False 522 Name: p2_dog, dtype: int64
image_predic['p3_dog'].value_counts()
True 1499 False 576 Name: p3_dog, dtype: int64
image_predic[(image_predic['p1_dog'] == False) & (image_predic['p2_dog'] == False) & (image_predic['p3_dog'] == False)].count()
tweet_id 324 jpg_url 324 img_num 324 p1 324 p1_conf 324 p1_dog 324 p2 324 p2_conf 324 p2_dog 324 p3 324 p3_conf 324 p3_dog 324 dtype: int64
image_predic[(image_predic['p1_dog'] == False) & (image_predic['p2_dog'] == False) & (image_predic['p3_dog'] == False)]
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 666051853826850816 | https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg | 1 | box_turtle | 0.933012 | False | mud_turtle | 0.045885 | False | terrapin | 0.017885 | False |
| 17 | 666104133288665088 | https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg | 1 | hen | 0.965932 | False | cock | 0.033919 | False | partridge | 0.000052 | False |
| 18 | 666268910803644416 | https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg | 1 | desktop_computer | 0.086502 | False | desk | 0.085547 | False | bookcase | 0.079480 | False |
| 21 | 666293911632134144 | https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg | 1 | three-toed_sloth | 0.914671 | False | otter | 0.015250 | False | great_grey_owl | 0.013207 | False |
| 25 | 666362758909284353 | https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg | 1 | guinea_pig | 0.996496 | False | skunk | 0.002402 | False | hamster | 0.000461 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2021 | 880935762899988482 | https://pbs.twimg.com/media/DDm2Z5aXUAEDS2u.jpg | 1 | street_sign | 0.251801 | False | umbrella | 0.115123 | False | traffic_light | 0.069534 | False |
| 2022 | 881268444196462592 | https://pbs.twimg.com/media/DDrk-f9WAAI-WQv.jpg | 1 | tusker | 0.473303 | False | Indian_elephant | 0.245646 | False | ibex | 0.055661 | False |
| 2046 | 886680336477933568 | https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg | 1 | convertible | 0.738995 | False | sports_car | 0.139952 | False | car_wheel | 0.044173 | False |
| 2052 | 887517139158093824 | https://pbs.twimg.com/ext_tw_video_thumb/88751... | 1 | limousine | 0.130432 | False | tow_truck | 0.029175 | False | shopping_cart | 0.026321 | False |
| 2074 | 892420643555336193 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
324 rows × 12 columns
1) There are 2074 rows in image predictions, where are as the 'twitter-archive-enhanced' has 2356 rows, the data is missing for 2356 - 2074 = 282 rows
2) There are total 324 rows where the the three predictions have not predicded a dog breed.
3) Multiple columns contain the same type of data (columns p1, p2 and p3 contain the same type of data, columns p1_conf, p2_conf and p3_conf contain the same type of data and the columns p1_dog, p2_dog and p3_dog contain the same tupe of data)
tweet_json.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 2354 non-null int64 1 retweet_count 2354 non-null int64 2 favorite_count 2354 non-null int64 dtypes: int64(3) memory usage: 55.3 KB
sum(tweet_json.duplicated())
0
tweet_json.head()
| id | retweet_count | favorite_count | |
|---|---|---|---|
| 0 | 892420643555336193 | 8853 | 39467 |
| 1 | 892177421306343426 | 6514 | 33819 |
| 2 | 891815181378084864 | 4328 | 25461 |
| 3 | 891689557279858688 | 8964 | 42908 |
| 4 | 891327558926688256 | 9774 | 41048 |
tweet_json.describe()
| id | retweet_count | favorite_count | |
|---|---|---|---|
| count | 2.354000e+03 | 2354.000000 | 2354.000000 |
| mean | 7.426978e+17 | 3164.797366 | 8080.968564 |
| std | 6.852812e+16 | 5284.770364 | 11814.771334 |
| min | 6.660209e+17 | 0.000000 | 0.000000 |
| 25% | 6.783975e+17 | 624.500000 | 1415.000000 |
| 50% | 7.194596e+17 | 1473.500000 | 3603.500000 |
| 75% | 7.993058e+17 | 3652.000000 | 10122.250000 |
| max | 8.924206e+17 | 79515.000000 | 132810.000000 |
1) There are 2354 rows in tweet json data, where are as the 'twitter-archive-enhanced' has 2356 rows, the data is missing for 2356 - 2354 = 2 rows
There are 181 redundant retweet rows 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'
There are 78 redundant in_reply_to rows 'in_reply_to_status_id', 'in_reply_to_user_id'
Incorrect data type for 'timestamp'
There are 23 rows where the 'rating_denominator' is not equal to 10
Dogs names entered as 'a' and 'an' instead of 'None'
'rating_numerator' for some of the rows is very high and seems incorrect
Missing data, there are 2074 rows in image predictions, where are as the 'twitter-archive-enhanced' has 2356 rows, the data is missing for 2356 - 2074 = 282 rows
There are total 324 rows where the the three predictions have not predicded a dog breed, this is a redundant data
In this section, clean all of the issues you documented while assessing.
Note: Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of tidy data. The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).
# Make copies of original pieces of data
tw_enh_csv_clean_df = df_tw_enh_csv.copy()
image_predic_clean_df = image_predic.copy()
tweet_json_clean_df = tweet_json.copy()
(Tidiness issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
The 4 columns 'doggo', 'floofer', 'pupper', 'puppo' to be converted to one single column 'stage' and drop these 4 columns
# Merge multiple column values into one column in python pandas
# https://stackoverflow.com/questions/33098383/merge-multiple-column-values-into-one-column-in-python-pandas
tw_enh_csv_clean_df['stage'] = tw_enh_csv_clean_df[['doggo', 'floofer', 'pupper', 'puppo']].apply(lambda x: ''.join(x), axis=1)
tw_enh_csv_clean_df['stage'].value_counts()
NoneNoneNoneNone 1976 NoneNonepupperNone 245 doggoNoneNoneNone 83 NoneNoneNonepuppo 29 doggoNonepupperNone 12 NoneflooferNoneNone 9 doggoNoneNonepuppo 1 doggoflooferNoneNone 1 Name: stage, dtype: int64
# For the entries where there is no dog stage mentioned in all the 4 columns (mentioned as 'None' in all the 4 columns, replace the text as 'No-stage'
tw_enh_csv_clean_df['stage'].replace('NoneNoneNoneNone','No-stage',inplace=True)
tw_enh_csv_clean_df['stage'].value_counts()
No-stage 1976 NoneNonepupperNone 245 doggoNoneNoneNone 83 NoneNoneNonepuppo 29 doggoNonepupperNone 12 NoneflooferNoneNone 9 doggoNoneNonepuppo 1 doggoflooferNoneNone 1 Name: stage, dtype: int64
# removing the text 'None' from 'stage' column
tw_enh_csv_clean_df['stage'] = tw_enh_csv_clean_df['stage'].map(lambda x: x.replace("None",""))
tw_enh_csv_clean_df['stage'].value_counts()
No-stage 1976 pupper 245 doggo 83 puppo 29 doggopupper 12 floofer 9 doggopuppo 1 doggofloofer 1 Name: stage, dtype: int64
# For the rows where the dog stage is provided in more than one column, the dog stage is updated with 'Or'
tw_enh_csv_clean_df.loc[tw_enh_csv_clean_df.stage == 'doggopupper', 'stage'] = 'doggo or pupper'
tw_enh_csv_clean_df.loc[tw_enh_csv_clean_df.stage == 'doggopuppo', 'stage'] = 'doggo or puppo'
tw_enh_csv_clean_df.loc[tw_enh_csv_clean_df.stage == 'doggofloofer', 'stage'] = 'doggo or floofer'
tw_enh_csv_clean_df['stage'].value_counts()
No-stage 1976 pupper 245 doggo 83 puppo 29 doggo or pupper 12 floofer 9 doggo or puppo 1 doggo or floofer 1 Name: stage, dtype: int64
# drop the four 'doggo', 'floofer', 'pupper', 'puppo' coloumns
drop_columns_1=['doggo','floofer','pupper','puppo']
tw_enh_csv_clean_df = tw_enh_csv_clean_df.drop(['doggo','floofer','pupper','puppo'], axis = 1)
assert tw_enh_csv_clean_df.columns.any() not in drop_columns_1
tw_enh_csv_clean_df.columns
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
'rating_denominator', 'name', 'stage'],
dtype='object')
tw_enh_csv_clean_df.sample(5)
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | stage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 354 | 831262627380748289 | NaN | NaN | 2017-02-13 22:03:49 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tucker. He's feeling h*ckin festive an... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/831262627... | 12 | 10 | Tucker | No-stage |
| 983 | 749395845976588288 | NaN | NaN | 2016-07-03 00:14:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is George. He just remembered that bees a... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/749395845... | 10 | 10 | George | No-stage |
| 829 | 768909767477751808 | NaN | NaN | 2016-08-25 20:35:48 +0000 | <a href="http://twitter.com/download/iphone" r... | RT @dog_rates: When it's Janet from accounting... | 7.001438e+17 | 4.196984e+09 | 2016-02-18 02:24:13 +0000 | https://twitter.com/dog_rates/status/700143752... | 10 | 10 | None | pupper |
| 1812 | 676811746707918848 | NaN | NaN | 2015-12-15 17:11:09 +0000 | <a href="http://twitter.com/download/iphone" r... | Say hello to Penny & Gizmo. They are pract... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/676811746... | 9 | 10 | Penny | No-stage |
| 1109 | 733828123016450049 | NaN | NaN | 2016-05-21 01:13:53 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Terry. The harder you hug him the fart... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/733828123... | 10 | 10 | Terry | No-stage |
(Quality issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
Drop all rows containing retweets where these columns will be non-null and then drop the retweet columns from the dataframe
The retweet columns are 'retweeted_status_id', 'retweeted_status_user_id' and 'retweeted_status_timestamp'.
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float64 7 retweeted_status_user_id 181 non-null float64 8 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 stage 2356 non-null object dtypes: float64(4), int64(3), object(7) memory usage: 257.8+ KB
# updating the dataframe only with non empty rows of the column 'retweeted_status_id'
tw_enh_csv_clean_df = tw_enh_csv_clean_df[tw_enh_csv_clean_df.retweeted_status_id.isna()]
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2175 entries, 0 to 2355 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2175 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2175 non-null object 4 source 2175 non-null object 5 text 2175 non-null object 6 retweeted_status_id 0 non-null float64 7 retweeted_status_user_id 0 non-null float64 8 retweeted_status_timestamp 0 non-null object 9 expanded_urls 2117 non-null object 10 rating_numerator 2175 non-null int64 11 rating_denominator 2175 non-null int64 12 name 2175 non-null object 13 stage 2175 non-null object dtypes: float64(4), int64(3), object(7) memory usage: 254.9+ KB
drop_columns_2 = ['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp']
tw_enh_csv_clean_df = tw_enh_csv_clean_df.drop(['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp'], axis = 1)
assert tw_enh_csv_clean_df.columns.any() not in drop_columns_2
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2175 entries, 0 to 2355 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2175 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2175 non-null object 4 source 2175 non-null object 5 text 2175 non-null object 6 expanded_urls 2117 non-null object 7 rating_numerator 2175 non-null int64 8 rating_denominator 2175 non-null int64 9 name 2175 non-null object 10 stage 2175 non-null object dtypes: float64(2), int64(3), object(6) memory usage: 203.9+ KB
(Quality issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
Find the index of in_reply_to rows and drop all rows containing in_reply_to where these columns will be non-null and then drop the in_reply_to columns from the dataframe
The in_reply_to columns are 'in_reply_to_status_id', 'in_reply_to_user_id'
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2175 entries, 0 to 2355 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2175 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2175 non-null object 4 source 2175 non-null object 5 text 2175 non-null object 6 expanded_urls 2117 non-null object 7 rating_numerator 2175 non-null int64 8 rating_denominator 2175 non-null int64 9 name 2175 non-null object 10 stage 2175 non-null object dtypes: float64(2), int64(3), object(6) memory usage: 203.9+ KB
# Find the index of in_reply_to with non-null rows and drop these rows
in_reply_to_index = list(tw_enh_csv_clean_df[tw_enh_csv_clean_df["in_reply_to_status_id"].isnull()==False].index)
tw_enh_csv_clean_df.drop(axis=0, index=in_reply_to_index, inplace=True)
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2097 non-null int64 1 in_reply_to_status_id 0 non-null float64 2 in_reply_to_user_id 0 non-null float64 3 timestamp 2097 non-null object 4 source 2097 non-null object 5 text 2097 non-null object 6 expanded_urls 2094 non-null object 7 rating_numerator 2097 non-null int64 8 rating_denominator 2097 non-null int64 9 name 2097 non-null object 10 stage 2097 non-null object dtypes: float64(2), int64(3), object(6) memory usage: 196.6+ KB
# drop the coloumns 'in_reply_to_status_id', 'in_reply_to_user_id'
drop_columns_3 = ['in_reply_to_status_id','in_reply_to_user_id']
tw_enh_csv_clean_df = tw_enh_csv_clean_df.drop(['in_reply_to_status_id','in_reply_to_user_id'], axis = 1)
assert tw_enh_csv_clean_df.columns.any() not in drop_columns_3
for in_reply_to in drop_columns_3:
if in_reply_to in list(tw_enh_csv_clean_df.index):
print('Found a in_reply_to - non-null data')
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2097 entries, 0 to 2355 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2097 non-null int64 1 timestamp 2097 non-null object 2 source 2097 non-null object 3 text 2097 non-null object 4 expanded_urls 2094 non-null object 5 rating_numerator 2097 non-null int64 6 rating_denominator 2097 non-null int64 7 name 2097 non-null object 8 stage 2097 non-null object dtypes: int64(3), object(6) memory usage: 163.8+ KB
(Quality issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
Drop all rows with missing data in 'expanded_urls' coulmn
# updating the dataframe only with non empty rows of the column 'expanded_urls'
tw_enh_csv_clean_df = tw_enh_csv_clean_df[tw_enh_csv_clean_df.expanded_urls.notnull()]
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2094 entries, 0 to 2355 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2094 non-null int64 1 timestamp 2094 non-null object 2 source 2094 non-null object 3 text 2094 non-null object 4 expanded_urls 2094 non-null object 5 rating_numerator 2094 non-null int64 6 rating_denominator 2094 non-null int64 7 name 2094 non-null object 8 stage 2094 non-null object dtypes: int64(3), object(6) memory usage: 163.6+ KB
(Quality issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
Drop all rows with 'rating_denominator' coulmn is not equal to 10
tw_enh_csv_clean_df[tw_enh_csv_clean_df["rating_denominator"]!=10].count()
tweet_id 17 timestamp 17 source 17 text 17 expanded_urls 17 rating_numerator 17 rating_denominator 17 name 17 stage 17 dtype: int64
# updating the dataframe only with rows where the column 'rating_denominator' is equal to 10
tw_enh_csv_clean_df = tw_enh_csv_clean_df[tw_enh_csv_clean_df.rating_denominator==10]
tw_enh_csv_clean_df[tw_enh_csv_clean_df["rating_denominator"]!=10].count()
tweet_id 0 timestamp 0 source 0 text 0 expanded_urls 0 rating_numerator 0 rating_denominator 0 name 0 stage 0 dtype: int64
(Quality issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
Cap the 'rating_numerator' to 30 an drop all rows with 'rating_numerator' coulmn is greater than 30
tw_enh_csv_clean_df['rating_numerator'].value_counts()
12 485 10 434 11 413 13 287 9 152 8 98 7 51 14 38 5 34 6 32 3 19 4 15 2 9 1 4 27 1 75 1 0 1 1776 1 26 1 420 1 Name: rating_numerator, dtype: int64
rating_numerator_index = tw_enh_csv_clean_df.rating_numerator >= 30
tw_enh_csv_clean_df[rating_numerator_index].rating_numerator.value_counts()
75 1 1776 1 420 1 Name: rating_numerator, dtype: int64
# updating the dataframe only with rows where the column 'rating_numerator' is less than 30
tw_enh_csv_clean_df = tw_enh_csv_clean_df[tw_enh_csv_clean_df.rating_numerator<30]
tw_enh_csv_clean_df['rating_numerator'].value_counts()
12 485 10 434 11 413 13 287 9 152 8 98 7 51 14 38 5 34 6 32 3 19 4 15 2 9 1 4 26 1 0 1 27 1 Name: rating_numerator, dtype: int64
(Quality issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
Change the 'timestamp' to datetime data type
tw_enh_csv_clean_df.timestamp = tw_enh_csv_clean_df.timestamp.astype('datetime64')
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2074 entries, 0 to 2355 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2074 non-null int64 1 timestamp 2074 non-null datetime64[ns] 2 source 2074 non-null object 3 text 2074 non-null object 4 expanded_urls 2074 non-null object 5 rating_numerator 2074 non-null int64 6 rating_denominator 2074 non-null int64 7 name 2074 non-null object 8 stage 2074 non-null object dtypes: datetime64[ns](1), int64(3), object(5) memory usage: 162.0+ KB
Twitter API additional data (dataframe = tweet_json_clean_df) should be part of Twitter archive data (dataframe = tw_enh_csv_clean_df)
(Tidiness issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
Twitter API additional data (dataframe = tweet_json_clean_df) has some missing information
(Quality issue, Twitter API additional data (dataframe = tweet_json_clean_df)
Merge the 'retweet_count' and 'favorite_count' columns from 'tweet_json_clean_df' dataframe to 'tw_enh_csv_clean_df' and remove any rows with missing information and drop the redundant columns
tweet_json_clean_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 2354 non-null int64 1 retweet_count 2354 non-null int64 2 favorite_count 2354 non-null int64 dtypes: int64(3) memory usage: 55.3 KB
# Merge the two dataframes
tw_enh_csv_clean_df = pd.merge(tw_enh_csv_clean_df, tweet_json_clean_df, left_on = ['tweet_id'], right_on = ['id'], how = 'left')
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2074 entries, 0 to 2073 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2074 non-null int64 1 timestamp 2074 non-null datetime64[ns] 2 source 2074 non-null object 3 text 2074 non-null object 4 expanded_urls 2074 non-null object 5 rating_numerator 2074 non-null int64 6 rating_denominator 2074 non-null int64 7 name 2074 non-null object 8 stage 2074 non-null object 9 id 2074 non-null int64 10 retweet_count 2074 non-null int64 11 favorite_count 2074 non-null int64 dtypes: datetime64[ns](1), int64(6), object(5) memory usage: 210.6+ KB
Note: As the columns 'tweet_id' and 'id' are same, drop the column 'id' which is redundant
# drop the column 'id'
tw_enh_csv_clean_df = tw_enh_csv_clean_df.drop(['id'], axis = 1)
tw_enh_csv_clean_df.columns
Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls',
'rating_numerator', 'rating_denominator', 'name', 'stage',
'retweet_count', 'favorite_count'],
dtype='object')
Tweet image prediction data (dataframe = image_predic_clean_df) should be part of Twitter archive data (dataframe = tw_enh_csv_clean_df)
(Tidiness issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
Tweet image prediction data (dataframe = image_predic_clean_df) has some missing information
(Quality issue, Tweet image prediction data (dataframe = image_predic_clean_df)
Merge the 'image_predic_clean_df' dataframe to 'tw_enh_csv_clean_df' and remove any rows with missing information and drop the redundant columns
# merge the two dataframes
tw_enh_csv_clean_df = pd.merge(tw_enh_csv_clean_df, image_predic_clean_df, left_on=['tweet_id'] , right_on = ['tweet_id'], how='left')
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2074 entries, 0 to 2073 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2074 non-null int64 1 timestamp 2074 non-null datetime64[ns] 2 source 2074 non-null object 3 text 2074 non-null object 4 expanded_urls 2074 non-null object 5 rating_numerator 2074 non-null int64 6 rating_denominator 2074 non-null int64 7 name 2074 non-null object 8 stage 2074 non-null object 9 retweet_count 2074 non-null int64 10 favorite_count 2074 non-null int64 11 jpg_url 1951 non-null object 12 img_num 1951 non-null float64 13 p1 1951 non-null object 14 p1_conf 1951 non-null float64 15 p1_dog 1951 non-null object 16 p2 1951 non-null object 17 p2_conf 1951 non-null float64 18 p2_dog 1951 non-null object 19 p3 1951 non-null object 20 p3_conf 1951 non-null float64 21 p3_dog 1951 non-null object dtypes: datetime64[ns](1), float64(4), int64(5), object(12) memory usage: 372.7+ KB
# drop the rows with missing data in the column 'jpg_url'
tw_enh_csv_clean_df = tw_enh_csv_clean_df[tw_enh_csv_clean_df.jpg_url.notnull()]
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1951 entries, 0 to 2073 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1951 non-null int64 1 timestamp 1951 non-null datetime64[ns] 2 source 1951 non-null object 3 text 1951 non-null object 4 expanded_urls 1951 non-null object 5 rating_numerator 1951 non-null int64 6 rating_denominator 1951 non-null int64 7 name 1951 non-null object 8 stage 1951 non-null object 9 retweet_count 1951 non-null int64 10 favorite_count 1951 non-null int64 11 jpg_url 1951 non-null object 12 img_num 1951 non-null float64 13 p1 1951 non-null object 14 p1_conf 1951 non-null float64 15 p1_dog 1951 non-null object 16 p2 1951 non-null object 17 p2_conf 1951 non-null float64 18 p2_dog 1951 non-null object 19 p3 1951 non-null object 20 p3_conf 1951 non-null float64 21 p3_dog 1951 non-null object dtypes: datetime64[ns](1), float64(4), int64(5), object(12) memory usage: 350.6+ KB
# drop the column 'img_num' as this is not an useful information
tw_enh_csv_clean_df = tw_enh_csv_clean_df.drop(['img_num'], axis = 1)
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1951 entries, 0 to 2073 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1951 non-null int64 1 timestamp 1951 non-null datetime64[ns] 2 source 1951 non-null object 3 text 1951 non-null object 4 expanded_urls 1951 non-null object 5 rating_numerator 1951 non-null int64 6 rating_denominator 1951 non-null int64 7 name 1951 non-null object 8 stage 1951 non-null object 9 retweet_count 1951 non-null int64 10 favorite_count 1951 non-null int64 11 jpg_url 1951 non-null object 12 p1 1951 non-null object 13 p1_conf 1951 non-null float64 14 p1_dog 1951 non-null object 15 p2 1951 non-null object 16 p2_conf 1951 non-null float64 17 p2_dog 1951 non-null object 18 p3 1951 non-null object 19 p3_conf 1951 non-null float64 20 p3_dog 1951 non-null object dtypes: datetime64[ns](1), float64(3), int64(5), object(12) memory usage: 335.3+ KB
tw_enh_csv_clean_df.sample(5)
| tweet_id | timestamp | source | text | expanded_urls | rating_numerator | rating_denominator | name | stage | retweet_count | ... | jpg_url | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 307 | 826204788643753985 | 2017-01-30 23:05:46 | <a href="http://twitter.com/download/iphone" r... | Here's a little more info on Dew, your favorit... | http://us.blastingnews.com/news/2017/01/kentuc... | 13 | 10 | None | doggo | 1075 | ... | https://pbs.twimg.com/media/C3dEza1WcAAhlNU.jpg | Labrador_retriever | 0.782058 | True | golden_retriever | 0.156581 | True | soft-coated_wheaten_terrier | 0.007275 | True |
| 488 | 793150605191548928 | 2016-10-31 18:00:14 | <a href="http://twitter.com/download/iphone" r... | This is Nida. She's a free elf. Waited so long... | https://twitter.com/dog_rates/status/793150605... | 11 | 10 | Nida | No-stage | 1984 | ... | https://pbs.twimg.com/media/CwHWOZ7W8AAHv8S.jpg | Italian_greyhound | 0.193869 | True | bluetick | 0.160380 | True | standard_poodle | 0.125982 | True |
| 1617 | 674805413498527744 | 2015-12-10 04:18:42 | <a href="http://twitter.com/download/iphone" r... | When your entire life is crumbling before you ... | https://twitter.com/dog_rates/status/674805413... | 10 | 10 | None | No-stage | 391 | ... | https://pbs.twimg.com/ext_tw_video_thumb/67480... | English_springer | 0.594467 | True | cocker_spaniel | 0.389994 | True | Welsh_springer_spaniel | 0.007096 | True |
| 1497 | 678798276842360832 | 2015-12-21 04:44:55 | <a href="http://twitter.com/download/iphone" r... | This is Linda. She fucking hates trees. 7/10 h... | https://twitter.com/dog_rates/status/678798276... | 7 | 10 | Linda | No-stage | 1350 | ... | https://pbs.twimg.com/media/CWuTbAKUsAAvZHh.jpg | Airedale | 0.583122 | True | silky_terrier | 0.129567 | True | Lakeland_terrier | 0.094727 | True |
| 590 | 776113305656188928 | 2016-09-14 17:40:06 | <a href="http://twitter.com/download/iphone" r... | This is Sampson. He's about to get hit with a ... | https://twitter.com/dog_rates/status/776113305... | 11 | 10 | Sampson | pupper | 5068 | ... | https://pbs.twimg.com/media/CsVO7ljW8AAckRD.jpg | mousetrap | 0.777468 | False | black_widow | 0.093940 | False | paddlewheel | 0.017492 | False |
5 rows × 21 columns
Multiple columns contain the same type of data (columns p1, p2 and p3 contain the same type of data, columns p1_conf, p2_conf and p3_conf contain the same type of data and the columns p1_dog, p2_dog and p3_dog contain the same type of data)
(Tidiness issue, Twitter archive data, dataframe = tw_enh_csv_clean_df)
Create a single column 'Breed' covering the breed of the dog (from p1, p2, p3) and another coulmn 'Confidence' covering the highest confidence from p1_conf, p2_conf, p3_conf and drop the redundant coulmns
# Check if 'p1_conf' > 'p2_conf' > 'p3_conf'
tw_enh_csv_clean_df.query('p1_conf < p2_conf or p2_conf < p3_conf')
| tweet_id | timestamp | source | text | expanded_urls | rating_numerator | rating_denominator | name | stage | retweet_count | ... | jpg_url | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog |
|---|
0 rows × 21 columns
# create empty lists to the best prediction or nil prediction for each row in the dataset
breed = []
confidence = []
# loop through prediction columns to find the best prediction which is a breed of dog.
for i, row in tw_enh_csv_clean_df.iterrows():
if (tw_enh_csv_clean_df.loc[i,'p1_dog'] == True):
breed.append(tw_enh_csv_clean_df.loc[i,'p1'])
confidence.append(tw_enh_csv_clean_df.loc[i,'p1_conf'])
elif (tw_enh_csv_clean_df.loc[i,'p2_dog'] == True):
breed.append(tw_enh_csv_clean_df.loc[i,'p2'])
confidence.append(tw_enh_csv_clean_df.loc[i,'p2_conf'])
elif (tw_enh_csv_clean_df.loc[i,'p3_dog'] == True):
breed.append(tw_enh_csv_clean_df.loc[i,'p3'])
confidence.append(tw_enh_csv_clean_df.loc[i,'p3_conf'])
else:
breed.append('No prediction')
confidence.append(0)
# add lists created to the main dataframe
tw_enh_csv_clean_df['Breed'] = breed
tw_enh_csv_clean_df['Confidence'] = confidence
tw_enh_csv_clean_df.columns
Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls',
'rating_numerator', 'rating_denominator', 'name', 'stage',
'retweet_count', 'favorite_count', 'jpg_url', 'p1', 'p1_conf', 'p1_dog',
'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog', 'Breed',
'Confidence'],
dtype='object')
tw_enh_csv_clean_df.sample(5)
| tweet_id | timestamp | source | text | expanded_urls | rating_numerator | rating_denominator | name | stage | retweet_count | ... | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | Breed | Confidence | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1544 | 676948236477857792 | 2015-12-16 02:13:31 | <a href="http://twitter.com/download/iphone" r... | This is Karl. Karl thinks he's slick. 6/10 sne... | https://twitter.com/dog_rates/status/676948236... | 6 | 10 | Karl | No-stage | 980 | ... | 0.611603 | False | macaque | 0.135176 | False | squirrel_monkey | 0.083247 | False | No prediction | 0.000000 |
| 120 | 864873206498414592 | 2017-05-17 16:00:15 | <a href="http://twitter.com/download/iphone" r... | We only rate dogs. Please don't send in Jesus.... | https://twitter.com/dog_rates/status/864873206... | 14 | 10 | None | No-stage | 9361 | ... | 0.478616 | False | lakeside | 0.114182 | False | wreck | 0.055926 | False | No prediction | 0.000000 |
| 365 | 816450570814898180 | 2017-01-04 01:05:59 | <a href="http://twitter.com/download/iphone" r... | Meet Moose. He doesn't want his friend to go b... | https://twitter.com/dog_rates/status/816450570... | 13 | 10 | Moose | No-stage | 9366 | ... | 0.352857 | False | envelope | 0.060107 | False | nail | 0.031291 | False | No prediction | 0.000000 |
| 558 | 781251288990355457 | 2016-09-28 21:56:36 | <a href="http://twitter.com/download/iphone" r... | This is Oakley. He just got yelled at for goin... | https://twitter.com/dog_rates/status/781251288... | 11 | 10 | Oakley | No-stage | 2458 | ... | 0.887771 | True | Italian_greyhound | 0.030666 | True | seat_belt | 0.026730 | False | Mexican_hairless | 0.887771 |
| 1834 | 670449342516494336 | 2015-11-28 03:49:14 | <a href="http://twitter.com/download/iphone" r... | Vibrant dog here. Fabulous tail. Only 2 legs t... | https://twitter.com/dog_rates/status/670449342... | 5 | 10 | None | No-stage | 729 | ... | 0.999924 | False | European_gallinule | 0.000030 | False | agama | 0.000022 | False | No prediction | 0.000000 |
5 rows × 23 columns
# removing the columns that are not required
drop_columns_4 = ['p1', 'p1_conf', 'p1_dog', 'p2','p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog']
tw_enh_csv_clean_df.drop(['p1', 'p1_conf', 'p1_dog', 'p2','p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'], axis = 1, inplace=True)
# Check if the columns that are not required are dropped
assert tw_enh_csv_clean_df.columns.any() not in drop_columns_4
tw_enh_csv_clean_df.columns
Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls',
'rating_numerator', 'rating_denominator', 'name', 'stage',
'retweet_count', 'favorite_count', 'jpg_url', 'Breed', 'Confidence'],
dtype='object')
tw_enh_csv_clean_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1951 entries, 0 to 2073 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1951 non-null int64 1 timestamp 1951 non-null datetime64[ns] 2 source 1951 non-null object 3 text 1951 non-null object 4 expanded_urls 1951 non-null object 5 rating_numerator 1951 non-null int64 6 rating_denominator 1951 non-null int64 7 name 1951 non-null object 8 stage 1951 non-null object 9 retweet_count 1951 non-null int64 10 favorite_count 1951 non-null int64 11 jpg_url 1951 non-null object 12 Breed 1951 non-null object 13 Confidence 1951 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(5), object(7) memory usage: 293.2+ KB
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".
tw_enh_csv_clean_df.to_csv('twitter_archive_master.csv',index=False)
df_new_tw_enh_csv_clean = pd.read_csv('twitter_archive_master.csv',parse_dates=['timestamp'],encoding='utf-8',infer_datetime_format=True)
df_new_tw_enh_csv_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1951 entries, 0 to 1950 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1951 non-null int64 1 timestamp 1951 non-null datetime64[ns] 2 source 1951 non-null object 3 text 1951 non-null object 4 expanded_urls 1951 non-null object 5 rating_numerator 1951 non-null int64 6 rating_denominator 1951 non-null int64 7 name 1951 non-null object 8 stage 1951 non-null object 9 retweet_count 1951 non-null int64 10 favorite_count 1951 non-null int64 11 jpg_url 1951 non-null object 12 Breed 1951 non-null object 13 Confidence 1951 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(5), object(7) memory usage: 213.5+ KB
# Some statistics of the data
df_new_tw_enh_csv_clean[['tweet_id','rating_numerator','retweet_count','favorite_count','Confidence']].describe()
| tweet_id | rating_numerator | retweet_count | favorite_count | Confidence | |
|---|---|---|---|---|---|
| count | 1.951000e+03 | 1951.000000 | 1951.000000 | 1951.000000 | 1951.000000 |
| mean | 7.361904e+17 | 10.554587 | 2782.882112 | 8967.771912 | 0.465732 |
| std | 6.771202e+16 | 2.232848 | 4707.842318 | 12308.752377 | 0.339514 |
| min | 6.660209e+17 | 0.000000 | 16.000000 | 81.000000 | 0.000000 |
| 25% | 6.758218e+17 | 10.000000 | 623.500000 | 1956.500000 | 0.140538 |
| 50% | 7.087381e+17 | 11.000000 | 1360.000000 | 4144.000000 | 0.460710 |
| 75% | 7.885890e+17 | 12.000000 | 3239.000000 | 11413.500000 | 0.776138 |
| max | 8.924206e+17 | 27.000000 | 79515.000000 | 132810.000000 | 0.999956 |
# Sort the values in descending order of 'favorite_count' and list the top 10
df_new_tw_enh_csv_clean[["tweet_id","rating_numerator","name","retweet_count","favorite_count"]].sort_values(by= 'favorite_count', axis = 0, ascending = False).head(10)
| tweet_id | rating_numerator | name | retweet_count | favorite_count | |
|---|---|---|---|---|---|
| 306 | 822872901745569793 | 13 | None | 48265 | 132810 |
| 764 | 744234799360020481 | 13 | None | 79515 | 131075 |
| 58 | 879415818425184262 | 13 | Duddles | 45849 | 107956 |
| 395 | 807106840509214720 | 13 | Stephan | 56625 | 107015 |
| 108 | 866450705531457537 | 13 | Jamesy | 32883 | 106827 |
| 329 | 819004803107983360 | 14 | Bo | 42228 | 95450 |
| 93 | 870374049280663552 | 13 | Zoey | 27680 | 85011 |
| 397 | 806629075125202948 | 13 | None | 37911 | 75639 |
| 134 | 859196978902773760 | 12 | quite | 25661 | 75193 |
| 798 | 739238157791694849 | 13 | None | 52360 | 75163 |
# Sort the values in descending order of 'retweet_count' and list the top 10
df_new_tw_enh_csv_clean[["tweet_id","rating_numerator","name","retweet_count","favorite_count"]].sort_values(by= 'retweet_count', axis = 0, ascending = False).head(10)
| tweet_id | rating_numerator | name | retweet_count | favorite_count | |
|---|---|---|---|---|---|
| 764 | 744234799360020481 | 13 | None | 79515 | 131075 |
| 395 | 807106840509214720 | 13 | Stephan | 56625 | 107015 |
| 798 | 739238157791694849 | 13 | None | 52360 | 75163 |
| 306 | 822872901745569793 | 13 | None | 48265 | 132810 |
| 58 | 879415818425184262 | 13 | Duddles | 45849 | 107956 |
| 329 | 819004803107983360 | 14 | Bo | 42228 | 95450 |
| 397 | 806629075125202948 | 13 | None | 37911 | 75639 |
| 630 | 761672994376806400 | 10 | None | 33421 | 55016 |
| 108 | 866450705531457537 | 13 | Jamesy | 32883 | 106827 |
| 1450 | 676219687039057920 | 10 | Kenneth | 31989 | 67100 |
df_new_tw_enh_csv_clean.plot(x = 'rating_numerator' ,y = 'favorite_count', kind = 'scatter')
<AxesSubplot:xlabel='rating_numerator', ylabel='favorite_count'>
df_new_tw_enh_csv_clean.plot(x = 'rating_numerator' ,y = 'retweet_count', kind = 'scatter')
<AxesSubplot:xlabel='rating_numerator', ylabel='retweet_count'>
Insight: There is no clear and direct relation between dogs ratings by WeRateDogs Vs Favorite and Retweet count (i.e. the dogs with highest rating does not necessarily have highest favourite or retweet count). In general the dogs with rating 14 (which is close to the highest rating of 15, excluding the outliers) have the highest favourite and retweet count.
# Taking the rows only with dog stages provided
df_stage = df_new_tw_enh_csv_clean[df_new_tw_enh_csv_clean["stage"]!='No-stage']
by_stages = df_stage.query('stage=="pupper" or stage=="doggo" or stage=="puppo" or stage=="floofer"')
df_stage = df_new_tw_enh_csv_clean.query('stage=="pupper" or stage=="doggo" or stage=="puppo" or stage=="floofer"')
df_total_favorite_counts_by_stages = pd.DataFrame(df_stage.groupby('stage').favorite_count.sum().sort_values())
df_total_favorite_counts_by_stages['stages'] = df_total_favorite_counts_by_stages.index
df_total_favorite_counts_by_stages.plot(x="stages",y='favorite_count',kind='bar',title="Total favorite counts by stages of dogs",color='g')
<AxesSubplot:title={'center':'Total favorite counts by stages of dogs'}, xlabel='stages'>
df_total_retweet_counts_by_stages = pd.DataFrame(df_stage.groupby('stage').retweet_count.sum().sort_values())
df_total_retweet_counts_by_stages['stages'] = df_total_favorite_counts_by_stages.index
df_total_retweet_counts_by_stages.plot(x="stages",y='retweet_count',kind='bar',title="Total retweet counts by stages of dogs",color='g')
<AxesSubplot:title={'center':'Total retweet counts by stages of dogs'}, xlabel='stages'>
df_avg_rating_numerator_by_stages = pd.DataFrame(df_stage.groupby('stage').rating_numerator.mean().sort_values())
df_avg_rating_numerator_by_stages['stages'] = df_avg_rating_numerator_by_stages.index
df_avg_rating_numerator_by_stages.plot(x="stages",y='rating_numerator',kind='bar',title="Average rating by stage",color='g')
<AxesSubplot:title={'center':'Average rating by stage'}, xlabel='stages'>
The dog stage 'Pupper' has got the highest favorite and retweet counts the average rating of the dog stage 'Pupper' is slightly lower than the remaining 3 dog stages
# unselcting the rows with no Breed prediction
df_30_most_tweeted_dogs = df_new_tw_enh_csv_clean.query('Breed ! = "No prediction"')
# Group by `Breed` and taking the count of favorite_count, sort it in descending order and select top 30
response = df_30_most_tweeted_dogs.groupby('Breed').count().favorite_count.sort_values(ascending=True).nlargest(30)
response
Breed golden_retriever 153 Labrador_retriever 103 Pembroke 94 Chihuahua 89 pug 62 toy_poodle 50 chow 47 Samoyed 42 Pomeranian 40 malamute 33 Chesapeake_Bay_retriever 31 French_bulldog 31 cocker_spaniel 30 miniature_pinscher 24 Eskimo_dog 22 German_shepherd 21 Cardigan 21 beagle 20 Staffordshire_bullterrier 20 Shih-Tzu 20 Siberian_husky 19 Shetland_sheepdog 18 Maltese_dog 18 Rottweiler 18 Lakeland_terrier 17 Italian_greyhound 17 kuvasz 17 basset 17 American_Staffordshire_terrier 16 West_Highland_white_terrier 15 Name: favorite_count, dtype: int64
plt.figure(figsize=(15,8))
response.plot(kind='bar', color=['orange'])
plt.title('Top 30 Dog Breeds most tweeted by WeRateDogs', fontsize=16)
plt.xlabel('Dog Breed')
plt.ylabel('Tweet Count');
golden_retriver is the most popular dog breed tweeted by WeRateDogs.
# Group by `Breed` and taking the count of favorite_count, sort it in descending order and select top 30
response = df_30_most_tweeted_dogs.groupby('Breed').sum().favorite_count.sort_values(ascending=True).nlargest(10)
response
Breed golden_retriever 1914145 Labrador_retriever 1253364 Pembroke 1038797 Chihuahua 717450 Samoyed 583906 French_bulldog 570921 chow 443610 cocker_spaniel 407412 pug 383817 toy_poodle 337270 Name: favorite_count, dtype: int64
plt.figure(figsize=(15,8))
response.plot(kind='bar', color=['red'])
plt.title('Top 10 Dog Breeds with overall highest favorite count', fontsize=16)
plt.xlabel('Dog Breed')
plt.ylabel('Overall favorite Count');
df_golden_retriever = df_new_tw_enh_csv_clean.query('Breed == "golden_retriever"')
df_golden_retriever.sort_values(by = 'favorite_count', ascending = False, inplace = True)
C:\Users\pradeep\AppData\Local\Temp\ipykernel_9508\2371614025.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_golden_retriever.sort_values(by = 'favorite_count', ascending = False, inplace = True)
df_golden_retriever.head(5)
| tweet_id | timestamp | source | text | expanded_urls | rating_numerator | rating_denominator | name | stage | retweet_count | favorite_count | jpg_url | Breed | Confidence | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 93 | 870374049280663552 | 2017-06-01 20:18:38 | <a href="http://twitter.com/download/iphone" r... | This is Zoey. She really likes the planet. Wou... | https://twitter.com/dog_rates/status/870374049... | 13 | 10 | Zoey | No-stage | 27680 | 85011 | https://pbs.twimg.com/media/DBQwlFCXkAACSkI.jpg | golden_retriever | 0.841001 |
| 444 | 795464331001561088 | 2016-11-07 03:14:10 | <a href="http://twitter.com/download/iphone" r... | Elder doggo does a splash. Both 13/10 incredib... | https://twitter.com/dog_rates/status/795464331... | 13 | 10 | None | doggo | 27728 | 55683 | https://pbs.twimg.com/ext_tw_video_thumb/79546... | golden_retriever | 0.193082 |
| 183 | 846514051647705089 | 2017-03-28 00:07:32 | <a href="http://twitter.com/download/iphone" r... | This is Barney. He's an elder doggo. Hitches a... | https://twitter.com/dog_rates/status/846514051... | 13 | 10 | Barney | doggo | 13076 | 48410 | https://pbs.twimg.com/media/C79sB4xXwAEvwKY.jpg | golden_retriever | 0.650003 |
| 39 | 883482846933004288 | 2017-07-08 00:28:19 | <a href="http://twitter.com/download/iphone" r... | This is Bella. She hopes her smile made you sm... | https://twitter.com/dog_rates/status/883482846... | 5 | 10 | Bella | No-stage | 10407 | 46860 | https://pbs.twimg.com/media/DELC9dZXUAADqUk.jpg | golden_retriever | 0.943082 |
| 45 | 882268110199369728 | 2017-07-04 16:01:23 | <a href="http://twitter.com/download/iphone" r... | This is Alfy. You're witnessing his first wate... | https://twitter.com/dog_rates/status/882268110... | 13 | 10 | Alfy | No-stage | 12118 | 45880 | https://pbs.twimg.com/media/DD5yKdPW0AArzX8.jpg | golden_retriever | 0.762211 |
fav_count = df_golden_retriever.favorite_count.iloc[0]
print ("favorite_count = ",fav_count)
url = df_golden_retriever.jpg_url.iloc[0]
r = requests.get(url)
Image.open(BytesIO(r.content))
favorite_count = 85011
fav_count = df_golden_retriever.favorite_count.iloc[1]
print ("favorite_count = ",fav_count)
url = df_golden_retriever.jpg_url.iloc[1]
r = requests.get(url)
Image.open(BytesIO(r.content))
favorite_count = 55683
fav_count = df_golden_retriever.favorite_count.iloc[2]
print ("favorite_count = ",fav_count)
url = df_golden_retriever.jpg_url.iloc[2]
r = requests.get(url)
Image.open(BytesIO(r.content))
favorite_count = 48410
fav_count = df_golden_retriever.favorite_count.iloc[3]
print ("favorite_count = ",fav_count)
url = df_golden_retriever.jpg_url.iloc[3]
r = requests.get(url)
Image.open(BytesIO(r.content))
favorite_count = 46860
fav_count = df_golden_retriever.favorite_count.iloc[4]
print ("favorite_count = ",fav_count)
url = df_golden_retriever.jpg_url.iloc[4]
r = requests.get(url)
Image.open(BytesIO(r.content))
favorite_count = 45880
df_dog_rating = df_new_tw_enh_csv_clean.groupby('Breed').count()
df_dog_rating
| tweet_id | timestamp | source | text | expanded_urls | rating_numerator | rating_denominator | name | stage | retweet_count | favorite_count | jpg_url | Confidence | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Breed | |||||||||||||
| Afghan_hound | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| Airedale | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 |
| American_Staffordshire_terrier | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 |
| Appenzeller | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Australian_terrier | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| toy_poodle | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
| toy_terrier | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| vizsla | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 |
| whippet | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
| wire-haired_fox_terrier | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
114 rows × 13 columns
df_dog_rating['rating_numerator'].sort_values().head(10)
Breed Bouvier_des_Flandres 1 Irish_wolfhound 1 standard_schnauzer 1 Japanese_spaniel 1 Scotch_terrier 1 silky_terrier 1 EntleBucher 1 clumber 1 Sussex_spaniel 2 black-and-tan_coonhound 2 Name: rating_numerator, dtype: int64
(df_dog_rating[df_dog_rating['rating_numerator'] < 10]).count()
#(df_new_tw_enh_csv_clean[df_new_tw_enh_csv_clean['rating_numerator'] < 10]).count()
tweet_id 61 timestamp 61 source 61 text 61 expanded_urls 61 rating_numerator 61 rating_denominator 61 name 61 stage 61 retweet_count 61 favorite_count 61 jpg_url 61 Confidence 61 dtype: int64
There are several dog breeds with rating '1' and 61 do breeds with rating less than 10